This notebook does topic modelling via BERTopic on the Kaggle dataset and the LIAR dataset.

We follow https://www.youtube.com/watch?v=v3SePt3fr9g Note we do not remove stop words in the BERTopic approach. Can merge topics: https://youtu.be/uZxQz87lb84?t=1002

Note that by default, BERTopic uses sentence-transformers/all-MiniLM-L6-v2 for its dense $384$D vector space embedding, and then does dimensionality reduction via UMAP (default) or PCA or truncated SVD or skipped.

Please download the Kaggle dataset and LIAR dataset first. I don't know how to handle big files on Github yet.

Read in datasets¶

In [1]:
from bertopic import BERTopic
import pandas as pd

# Read in Kaggle titles only. The dataframe df_kaggle has two columns: article titles, and (true/false).
kaggle_df_true = pd.read_csv('./kaggle_dataset/True.csv', usecols = [0])
kaggle_df_fake = pd.read_csv('./kaggle_dataset/Fake.csv', usecols = [0])
corpus_true = kaggle_df_true.loc[:,'title'].tolist()
corpus_fake = kaggle_df_fake.loc[:,'title'].tolist()

BERTopic modeling¶

In [2]:
topic_model_true = BERTopic(embedding_model="all-MiniLM-L6-v2")
topics_true, probs_true = topic_model_true.fit_transform(corpus_true)
In [3]:
topic_model_fake = BERTopic(embedding_model="all-MiniLM-L6-v2")
topics_fake, probs_fake = topic_model_fake.fit_transform(corpus_fake)
In [4]:
topic_model_true.get_topic_info()
Out[4]:
Topic Count Name
0 -1 5998 -1_korea_trump_north_white
1 0 461 0_tax_reform_corporate_rate
2 1 419 1_brexit_uk_eu_britain
3 2 343 2_china_xi_chinese_graft
4 3 340 3_iran_nuclear_deal_sanctions
... ... ... ...
342 341 10 341_malaysia_beer_festival_suspected
343 342 10 342_deportation_deportations_raids_reprieves
344 343 10 343_smuggling_migrants_brave_macedonian
345 344 10 344_secretary_army_fanning_nominate
346 345 10 345_felons_virginia_voting_restoring

347 rows × 3 columns

In [5]:
topic_model_fake.get_topic_info()
Out[5]:
Topic Count Name
0 -1 8173 -1_trump_in_to_video
1 0 535 0_melania_women_ivanka_sexual
2 1 489 1_students_college_school_student
3 2 381 2_obama_barack_president_speech
4 3 294 3_black_racist_supremacists_white
... ... ... ...
414 413 10 413_obstruction_niece_fired_grassroots
415 414 10 414_rioter_want_protester_priceless
416 415 10 415_100_interviewer_express_disgust
417 416 10 416_cover_globe_boston_runs
418 417 10 417_sue_amendment_suing_tweet

419 rows × 3 columns

From the documentation, the list of outliers can be assigned to the existing topics via:

Second, after training our BERTopic model, we can assign outliers to topics by making use of the .reduce_outliers function in BERTopic. An advantage of using this approach is that there are four built in strategies one can choose for reducing outliers. Moreover, this technique allows the user to experiment with reducing outliers across a number of strategies and parameters without actually having to re-train the topic model each time. You can learn more about the .reduce_outlier function here. The following is a minimal example of how to use this function:
In [6]:
# Reduce outliers
new_topics_true = topic_model_true.reduce_outliers(corpus_true, topics_true)
In [7]:
topic_model_true.visualize_barchart(width=180, height=400, top_n_topics=10, n_words=10)
In [8]:
topic_model_fake.visualize_barchart(width=180, height=400, top_n_topics=10, n_words=10)
In [9]:
# Filter above data frame by topic 0 only:
topic_true_df = pd.DataFrame({"topic": topics_true, "document": corpus_true})
topic_true = topic_true_df[topic_true_df.topic == 4]
for i in range(16):
    print(topic_true['document'].values[i])
U.S. calls Myanmar moves against Rohingya 'ethnic cleansing'
U.S. hopes to pressure Myanmar to permit Rohingya repatriation
U.S. Congress members decry 'ethnic cleansing' in Myanmar; Suu Kyi doubts allegations
Myanmar operation against Rohingya has 'hallmarks of ethnic cleansing', U.S. Congress members say
U.S. lawmakers target Myanmar military with new sanctions
Tillerson tells Myanmar army chief U.S. concerned about reported atrocities
U.S. weighs calling Myanmar's Rohingya crisis 'ethnic cleansing'
U.S. officials will not label treatment of Rohingya as 'ethnic cleansing'
U.S. says holds Myanmar military leaders accountable in Rohingya crisis
Lawmakers urge U.S. to craft targeted sanctions on Myanmar military
Senators urge Trump administration to act on Myanmar Rohingya
Exclusive: Overruling diplomats, U.S. to drop Iraq, Myanmar from child soldiers' list
Obama announces lifting of U.S. sanctions on Myanmar
Exclusive: U.S. to renew most Myanmar sanctions with changes to aid business
Senate unanimously approves Myanmar ambassador nominee
Senate panel approves Myanmar nominee
In [10]:
# Filter above data frame by topic 1 only:
topic_fake_df = pd.DataFrame({"topic": topics_fake, "document": corpus_fake})
topic_fake = topic_fake_df[topic_fake_df.topic == 4]
for i in range(16):
    print(topic_fake['document'].values[i])
 Bad News For Trump — Mitch McConnell Says No To Repealing Obamacare In 2018
 Maine Voters Tell Trump To Go F*ck Himself, Expand Medicaid Through Obamacare
 The Numbers Are In: States, Insurers Literally Say Obamacare Trainwreck Is TRUMP’S Fault
 Trump’s Press Secretary Falls Apart, Exposes His Lie About Obamacare Vote (VIDEO)
 Trumpcare Is Officially Dead, Senator Collins Confirms She’s Voting No
 WATCH: GOP Senator Yawns As Disabled Healthcare Protesters Are Being Dragged Away By Cops
 Trump’s Making It Harder To Sign Up For Obamacare On Purpose, Even If The GOP Doesn’t Pass Anything
 Republican Senator STUNS In Town Hall, Admits GOP’s ObamaCare Repeal Will Fail (TWEETS)
 Medicaid Directors Of All 50 States Issue Joint Statement Slamming GOP Health Bill
 Trump Regrets, Move Over: ‘Sassy Gay Republican’ Is All Of The Healthcare Angst We Need Right Now
 Obama Just Made A VERY Powerful Statement About Trump’s Attempts To Repeal Obamacare (VIDEO)
 Trump Vows To Save America From ‘Curse’ Of Functional Health Care System
 Fed Up With Congress, Trump Just Put A Big Nail In Obamacare’s Coffin
 Want To Ride On Air Force One As A Senator? Then Be Prepared To Vote For Trump’s Health Care Bill
 Trump Is Raising Your Healthcare Premiums, And That’s A Fact
 Republican Senator Predicts Trump’s Next Big Legislative Push Will Fail Just Like Trumpcare

From the above, we see that the Boiler Room Podcast (conspiracy theories?) should be removed from preprocessing! https://alternatecurrentradio.com/category/boiler-room/, https://www.youtube.com/@AlternateCurrentRadio/featured, https://alternatecurrentradio.com/voodoo-nipple-calculus/, https://alternatecurrentradio.com/world-war-mrna/

In [ ]: